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The  object  of  this  grant  is  the  analysis  and  design  of  decision  procedures 
that  have  stable,  good  performance  in  statistically  ill-defined  environments. 
Such  procedures  indicate  the  way  to  design  powerful  receivers  for  systems 
whose  statistical  behavior  can  not  be  described  precisely  (due  to  incomplete 
availability  of  data  about  the  system  behavior). 

In  the  framework  of  this  idea  the  following  progress  has  been  already  made: 

1.  Different  distance  measures  have  been  studied  for  use  as 
performance  criteria  for  robust  estimates.  Careful  evaluation  and  comparison 
of  these  distances  was  done  and  their  similarities,  advantages  and  disadvantages 
were  carefully  stated.  It  was  observed  that  some  of  these  distances  are  more 
naturally  related  to  the  estimation  problem  and  that  in  cases  in  which  they 
are  equivalent,  the  designer  may  use  the  one  that  is  computationally  or 
structurally  more  convenient.  The  use  of  the  Vasershtein  distance  was  proposed 
and  used  as  stability  measure  for  estimates  in  statistically  contaminated 


environments.  This  distance  is  naturally  related  to  the  commonly  used  performance 
measures  in  parameter  estimation.  Through  the  use  of  the  Vasershtein  distance. 


2.  A thorough  study  of  the  work  already  accomplished  (by  the  author 


as  well  as  other  investigators)  on  nonparametric  statistical  procedures  in 
the  presence  of  small  number  of  discrete  data  was  done  and  included  in  a 
book  on  the  use  of  nonparametric  procedures  in  Communication  Systems. 

3.  A feature  selection  problem  was  studied,  when  several  distance 
measures  are  used  as  discrimination  criteria.  This  helped  for  a better 
understanding  of  the  qualities  of  the  distances.  It  was  found  that  the 
feature  extraction  algorithm  is  sometimes  independent  of  the  criterion. 

This  allows  the  maintenance  of  a single  feature  construction  mechanism  that 
works  equally  well  for  several  systems  with  different  specifications.  This 
feature  selection  algorithm  is  then  robust. 

4.  A sequencial  procedure  for  clustered  data  was  proposed  and  analyzed. 
This  procedure  applies  to  several  stages  of  statistical  information  about  the 
system  and  it  varies  from  the  known  procedures  in  the  fact  that  data  collection 
costs  are  included  and  the  data  clusters  considered  are  finite  in  number. 

The  results  are  therefore  nonasynpotic  and  they  apply  to  any  problem  in  which 
the  data  are  collected  sequentially  in  clusters  and  there  is  a preassigned 
maximum  number  of  such  clusters  available.  The  results  have  been  tested 
numerically  for  some  systems  with  given  specifications. 

5.  Hampel's  general  qualitative  definition  of  robustness  of  sequences 

of  estimators  on  memoryless  observation  processes  was  generalized  to  stationary 
processes.  Structural  properties  of  the  estimates  were  found  in  this  case 
and  based  on  these  properties  the  design  of  robust  estimates  that  operate  on 
dependent  data  Is  now  in  progress. 


6.  The  constructive  analysis  of  robustness  completed  by  the  author  is 


being  used  now  in  the  performance  analysis  of  communication  Networks  at 

! ^ 

Bell  Laboratories. 

7.  The  discrimination  of  Gaussian  processes  has  been  studied  and  efficient 
computationally  methods  have  been  found.  This  method  leads  also  to  efficient 
discrimination  of  contaminated  Gaussian  processes. 

In  twelve  months,  one  Ph.D.  thesis  and  one  book  have  been  partially 
supported  by  this  grant,  three  papers  have  been  submitted  to  journals,  four 
conference  presentations  have  been  made,  two  University  and  three  Bell 
Laboratories  reports  have  been  produced.  Finally,  two  seminars  at  Bell 
Telephone  Laboratories  have  been  presented. 

In  what  follows,  a list  of  publications  supported  by  this  grant,  and 
some  of  the  work  accomplished  that  is  not  included  in  the  semiannual  report 
dated  May  6,  1976,  are  presented. 
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1.  R.Y.S.  Li,  "Methods  for  Data  Reduction,"  May  1977. 

Books  Published: 

I.  P.  Papantoni -Kazakos  and  D.  Kazakos,  editors  and  contributors, 
"Nonparametrlc  Methods  in  Communications.  Selected  Topics." 

Marcel  Oekker  Inc.,  New  York  1977. 

Papers  Submitted  to  Journals: 

1.  P.  Papantoni -Kazakos,  "Some  Distance  Measures  and  Their  use  in  a 
Feature  Selection  Problem." 

2.  P.  Papantoni -Kazakos  and  R.  M.  Gray,  "Robustness  of  Estimators  on 
Stationary  Observation." 

3.  D.  Kazakos  and  P.  Papantoni -Kazakos,  "Asymptotic  Discrimination  of 
Gaussian  Processes. 

Conference  Presentations: 

1.  P.  Papantoni -Kazakos , "Some  distance  measures  and  their  use  in  feature 
selection,"  Eleventh  Annual  Conference  on  Information  Sciences  and 
Systems,  The  Johns  Hopkins  University. 

2.  P.  Papantoni -Kazakos , D.  Kazakos  and  R.  Li,  "A  Kalman  Filtering 
Formulation  for  the  Linear  Reduction  of  Gaus-Markov  Data,"  Eleventh 
Annual  Conference  on  Information  Sciences  and  Systems,  The  Johns 
Hopkins  University. 

3.  D.  Kazakos  and  P.  Papantoni -Kazakos,  "Robust  Rate  Distort icn," 
International  Symposium  on  Information  Theory,  1977. 

4.  P.  Papantoni -Kazakos , "Some  Problems  in  Communication  Networks," 
Fifteenth  Annual  Allerton  Conference  on  Cricuit  and  System  Theory,  1977. 


Uni  vers !ty  Reports: 


1.  P.  Papanton i -Kazakos , “Some  Distance  Measures  and  Their  use  in 
Feature  Selection,"  Rice  University  E.E.  Technical  Report  #7611, 
November  1976. 

2.  P.  Papanton i -Kazakos , “Some  New  Performance  Criteria  in  Robust 
Statistics  - Small  Sample  Robustness,"  Technical  Report  #7701, 
January  1977. 


Bell  Laboratories  Technical  Memoranda: 

1.  P.  Papanton i -Kazakos,  “Some  Distance  Measures  and  Their  Use  in 
Feature  Selection,"  TM-77-3452-5 , July  12,  1977. 

2.  P.  Papanton! -Kazakos,  “Some  Performance  Criteria  Incorporating 
Data  Dependence  in  Robust  Estimation,"  TM-77-3452-4 , July  12,  1977. 

3.  P.  papanton! -Kazakos  and  R.  M.  Gray,  “Robustness  of  Estimation  on 
Stationary  Observations,"  TM-77-3452-7 , September  20,  1977. 


Seminars  Presented: 

1.  P.  Papanton! -Kazakos , "The  Vasershtein  Distance  in  the  Constructive 
Analysis  of  Robust  Estimates,"  Bell  Telephone  Laboratories,  Hoimdei, 
New  Jersey,  April  1977. 

2.  P.  Papanton! -Kazakos,  "Robust  Estimators  on  Stationary  Observations," 
Bell  Telephone  Laboratories,  Holmdel,  New  Jersey,  October  1977. 


Comments  on  the  Accomplished  Work  From  Scientists  in  the  Field 


The  constructive  analysis  of  robustness  with  the  use  of  a Vasershtein 
stabiiity  criterion  has  been  considered  as  more  naturaily  incorporation  the 
proper  performance  criteria  in  parameter  estimation  by  peopie  at  Stanford 
University  and  Beil  Telephone  Laboratories,  that  I talked  to.  Also,  the 
extension  of  the  analysis  to  data  evolving  from  general  stationary  process 
(rather  than  just  process  with  independent  data),  has  been  considered  important 
for  the  understanding  of  robust  estimates  in  the  presence  of  dependent  data 
structures. 

The  study  and  evaluation  of  different  distance  measures  and  their 
applications  to  the  feature  selection  problem  has  been  considered  valuable 
by  attendies  of  the  1977  Johns  Hopkins  Conference.  The  different  distance 
measures  are  used  as  different  discriminant  measures,  each  representing  a 
different  class  of  problems.  Their  uniform  evaluation  and  comparison  that 
has  not  been  done  before  and  the  analysis  of  their  value  to  the  feature 
extraction  problem  has  been  considered  a nice  contribution. 

The  sequencial  decision  scheme  included  in  the  thesis  enclosed  here,  and 
more  particular  its  version  for  two  nonparmetric  distinct  classes  has  been 
considered  very  valuable  by  scientists  in  pattern  recognition.  Its  use  allows 
data  savings  as  well  as  good  performance  for  discrimination  between  two 
statistically  ill-defined  data  classes. 

Workshops  Attended: 

1.  1977  Communications  Workshop,  Tuscon,  Arizona,  April  1977. 
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ABSTRACT 


A general  discussion  is  presented  on  some  of  the  open  problems 
in  communication  networks.  Routing  structures  and  causes  for  unsuccess- 
ful communication  through  the  network  are  emphasized.  Some  open  problems 
involving  sophisticated  parametric  as  well  as  robust  statistical 
algorithms  are  stated. 
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Description  of  the  Network 

To  understand  some  of  the  problems  involved  in  reliably  communicating 
messages  within  the  network,  some  basic  network  operations  must  be  described. 

The  smallest  element  (that  is  of  any  interest  to  the  network  analyst) 
in  a communication  network  is  a center.  A center  consists  of  several  units 
that  communicate  directly  with  each  other.  Different  centers  communicate 
through  a number  of  routes,  where  each  time  the  route  one  particular  message 
is  carried  on  is  chosen  hierarchically.  Each  route  consists  of  a number  of 
links  that  are,  in  general,  connected  to  each  other  through  tandems  (switching 
offices).  Finally,  each  link  consists  of  a number  of  single  message  carriers 
that  are  called  trunks,  while  the  tandems  connect  several  centers.  A message 
originating  at  center  A (figure  1)  and  with  destination  another  center  B 
follows  a routing  hierarchy  described  as  follows: 

At  first  tries  the  direct  route  that  consists  of  a single  link 
connecting  the  two  centers  (dotted  line  in  figure  1).  If  all  the  trunks  in 
this  link  are  functioning  properly  but  are  busy,  the  message  tries  the  next 
route  in  the  hierarchy  (route  through  A,  T^ , B in  figure  1).  If  this  route 
is  also  well  functioning  but  busy,  the  message  tries  the  next  route  in 
hierarchy  and  so  on,  until  it  reaches  the  final  route  available  to  it  (route 
A T^T^B  in  figure  1).  If  this  last  route  is  busy  or  malfunctioning,  the 
message  fails  to  go  through  and  a communication  failure  to  B is  recorded 


at  A. 
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The  rejection  of  the  message  by  a particular  route  due  to  full  occupancy 

(at  the  moment)  of  all  trunks  involved  is  called  blocking.  Under  healthy 

network  conditions  blocking  probabilities  can  be  assigned  to  each  route 

that  correspond  to  a particular  center  pair  (A,B)  and  are  functions  of  the 

A to  B communication  load,  the  number  of  routes  connecting  A B,  and  the 

e 

number  of  trunks  in  each  such  route. 

Suppose  now  that  an  "average  load"  time  period  is  considered  and  the 
communication  from  center  A to  center  B is  studied.  If  in  some  of  the  routes 
between  A and  B a link  is  malfunctioning  (due  to  some  faulty  trunk),  and 
if  center  A is  unaware  of  the  malfunction,  messages  from  A to  B wi 1 1 keep 
trying  this  link  with  probability  specified  by  the  initial  routing  structure 
and  the  "average  load".  As  a result  to  that  some  messages  will  be  killed 
by  the  malfunctioning  link  and  communication  fai lures  from  A to  B wi 1 1 be 
recorded.  Therefore,  in  the  presence  of  faulty  links  which  center  A is 
unaware  of,  communication  failures  will  be  caused  that  are  not  Just  due  to 
overload  and  are  not  happening  Just  at  the  highest  in  hierarchy  route. 

The  routing  structure  described  above  is  based  on  a trade  off  between 
economy  and  communication  efficiency.  The  direct  links  (dotted  line  in 
figure  1)  carry  usually  the  highest  portion  of  the  message  load,  while  the 
higher  in  hierarchy  alternate  routes  are  used  during  traffic  picks  and  they 
have  capacity  high  enough  to  secure  good  communication  when  such  picks  are 
occurring  and  low  enough  so  that  they  do  not  remain  idle  most  of  the  time. 
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The  performance  of  the  network,  as  viewed  by  the  users,  is  measured 
through  its  ability  to  successfully  respond  to  communication  attempts. 

It  efficiency  as  viewed  by  an  outside  observer  is  a combination  of  two 
factors;  effectiveness  in  responding  to  communication  demands,  and  average 
degree  of  occupancy. 

2.  Some  Open  Problems 

We  are  concentrating  here  on  the  performance  evaluation  of  the  network. 

The  following  major  question  arises  in  this  case; 

Is  it  possible  to  evaluate  the  network  per'  (rmance  at  a particular 
time,  if  yes  what  kind  of  data  are  required  anc  ow  can  such  an  evaluation 
be  effective  without  utilizing  an  excessive  amount  of  information?  Also, 
how  can  malfunctions  be  localized  or  even  predicted  with  the  use  of  economically 
attractive  methods? 

In  two  Bell  Labs  technical  memoranda  that  have  not  been  cleared  for 
publication  yet,  the  author  analyzes  the  use  of  limited  center-to-center 
successful  and  unsuccessful  communication  completions  for  locating  the 
faulty  links  that  cause  the  failures,  and  for  continuously  monitoring  the 
quality  in  communication  throughout  the  network.  The  algorithms  used  are 
economical  not  only  because  they  only  utilize  a limited  amount  of  information, 
but  also  because  they  are  one  step  memory  and  computationally  efficient. 

However,  assumptions  as  to  the  routing  structures  have  been  made  that  in 
some  cases  need  relaxation.  Specifically,  "average  load"  time  periods  are 


observed  and  the  routing  probabilities  are  then  considered  unchanged.  But, 
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even  during  such  time  periods  the  load  fluctuations  may  be  momentarily 
substantial,  In  which  case  "robust"  algorithms  that  are  mostly  insensitive 
to  such  load  variations  must  be  developed. 

Furthermore,  the  effect  of  malfunctioning  network  links  to  the  relation- 
ship between  communication  messages  through  them  must  be  studied  further. 
Specifically,  such  links  may  cause  "partial  message  killing"  as  well  as 
interference  between  messages.  These  effects  result  In  additional  reduction 


of  the  communication  quality  within  the  network. 
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AsyiopCotlc  Discrimination  of  Gaussian  Processes 


Dimitri  Kazakos*  and  Titsa  Papantoni-Kazakos** 
State  University  of  New  York  at  ^uffalib 


Bell  Laboratories 


ABSTRACT 


We  present  a theory  on  the  asymptotic  approximation  of  block  Toeplltz 
matrices  by  block  clrculant  ones.  The  method  is  then  applied  to  the 
calculation  of  the  asymptotic  Bhattacharyya  distance  B^  and  divergence 
between  two  stationary  vector  Gaussian  processes.  In  terms  of  the  two 
spectral  density  matrices,  Fj^(A),  F^d).  Specifically, 

—1  1 1 1 
B - 11m  n - (2n)  log{|2  Fj^(A)  + 2 • 

n O'* 

• lF^(A)r^^^|F2(A)r^''^}dA 

-1  -1  -1 
J ■ lim  n J * (2n)  trace  [F, (A)F  (A)  + 

n " qJ  12 

~ 2I]dA 

The  above  expressions  are  useful  because  of  existing  upper  and  lower 
bounds  to  the  Bayes  error  of  mlsclasslf Icatlon.  Furthermore,  they  can 
be  considered  as  distance  measures  In  their  own  right.  The  availability 
of  efficient  spectral  estimation  techniques  renders  them  most  useful. 


* Research  supported  by  NSF  Grant  ENG  76  20295. 

**  Research  supported  by  Air  Force  Grant  AFOSR  77-3156. 


I . INTRODUCTION 


It  Is  well  known  that  the  Bayes  decision  rule  Is  the  optimal  one 
in  deciding  between  two  statistical  hypotheses  with  known  prior  probabilities 
and  conditional  probability  density  functions.  One  of  the  most  common 
statistical  models  for  data  is  the  Gaussian  random  process.  In  assessing 
the  performance  of  the  statistical  classifier  using  the  Bayes  decision 
rule,  one  is  faced  with  the  difficult  task  of  evaluating  its  performance 
through  the  available  expression  for  the  probability  of  mlsclassif icatlon. 


Pe  - I mln[Trj^fj^(x")  , Tr2f2(x")  ]dx" 


where 


r 1 T,^ 

X “ [x,  . . . X J , X.  e R , 

1 n 1 

fj^(x"),  f2(x”)  are  the  conditional  p.d.f.  and  are  the  prior 

probabilities  of  hypotheses  H2.  Clearly,  numerical  integration 
techniques  have  to  be  used.  Due  to  the  high  dimension  of  the  integration 
region,  numerical  techniques  are  costly  and  they  do  not  provide  under- 
standing of  the  influence  of  several  parameters  of  interest  in  f^^,  f2  to 
Pe.  For  example,  if  one  wishes  to  reduce  the  data  by  some  feature  selection 
techniques,  the  expression  (1)  cannot  by  useful  in  choosing  the  optimal 
transformation  of  x".  Also,  (1)  does  not  provide  any  feeling  as  to  the 
incremental  reduction  of  Pe  as  n grows. 

The  following  pair  of  bounds  to  Pe  is  known:  [ 1 ] - [ 4 ] 


-1  1/2 
2 exp{-2B  } < Pe  < (n.'n,)  exp{-B  ) 

1 z n X / n 


(3) 


I 


8'^  exp{-2~^J  } < Pe 
n — 


where : 

- -log  I [fj^(x”)f2(x")]^^^dx"  (4) 

Jn  - I [fj^(x")  - f2(x")]log!f^(x")f"^(x")]dx"  (5) 

A lower  bound  tighter  than  (3)  has  been  developed  in  [ 5 ] . It  has 

been  shown  In  [ 6 ] that  no  upper  bound  to  Pe  in  terms  of  J exists.  In 

n 

the  present  paper,  we  will  develop  asymptotic  expressions  for 

B ■ lim  n ^ B , J = lim  n ^ J (6) 

n n 

n -►  » n -+•  “ 

in  terms  of  the  spectral  density  matrices  F^(X),  F2(X).  The  motivation 
lies  in  the  fact  that  the  spectral  densities  are  among  the  first  char- 
acteristics of  a process  to  be  measured,  and  very  efficient  spectral  estimation 
techniques  are  available  in  the  statistical  literature.  [ 7 ] - [10]. 

The  technique  to  be  used  is,  we  think,  interesting  by  itself,  and 
useful  in  other  applications.  It  is  based  on  the  asymptotic  approximation 
of  block  Toeplitz  matrices  by  block  circulant  ones.  A slmillar  technique 
was  used  in  [11]  -[13]  for  evaluation  of  rate-distortion  functions. 

The  distance  measures  derived  have  several  potential  applications, 
which  are  discussed  briefly  in  the  section  that  follows.  Those  are: 

(a)  Feature  selection  in  high-dimensional  observation  spaces 
in  pattern  recognition. 

(b)  Clustering  algorithms  for  the  same  situations  as  in  (a). 

(c)  Reduction  of  remote  sensing  data. 

(d)  Speech  processing. 

(e)  Biomedical  EEC  signal  analysis. 

(f)  Tone  Detection  in  Telephone  Networks. 
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II.  ASYMPTOTIC  APPROXIMATIONS 


A block  Toeplltz  knxkn  matrix  R has  the  form: 

n 


(7) 


where  R.  are  kxk  matrices.  A block  circulant  knxkn  matrix  C has  the  form 
i n 


(8) 


where  are  kxk  matrices. 


Consider  the  knxkn  matrix 


V - ^ 
n fn 


‘k 

w^I, 


®T 

w I. 


n-1 
w I, 


w^I 

V 

w^l. 


2m_ 
w I, 


„2(n-l) 


n-1 
w I 


V 


k 

2 (n-1), 


(n-l)‘, 
w ' I, 


(9) 


where  w ••  exp(12irn  ^). 

It  can  be  easily  shown  that  Is  a unitary  matrix,  i.e. 


n 


(10) 


Consider  now  the  matrix 


C - V“^  C V = C V 
n n n n n n n 


(11) 


Performing  the  multiplication  (11),  we  observe  that  is  block  diagonal: 


C(0) 


C(2nn‘^) 


(12) 


C(2irn'^(n-1)) 


where,  C(u)  Is  a kxk  matrix  function  defined  as: 


n-1 

C(u)  ■ I C exp  (-Imu) 
in 

m«0 


(13) 


Consider,  now  the  problem  of  finding  the  eigenvalues  of 
solutions  of  the  equation  In  x: 


They  are 


0 - |C  - xl.  I - |V"^(C  - xl.  )V*^1  - |C  - xl.  I - 

' n xn'  ' n ' n kn  n'  ' n kn' 


n-1 


n |c(2iTmn  - xl  I 
m«0 


(14) 


Thus,  the  kn  eigenvalues  of  are  Identical  to  the  union  of  n sets  of 
eigenvalues  of  the  matrices  (C(2Trmn  ^)  , m - 0,  . . . , n-1}.  Let  us 


now  define  the  weak  norm  of  an  sxs  matrix  A ■ 


s s 


H - Is'^  I i - fs’^  I Iq  I") 

1-1  j-1  1-1 


2,1/2 


(15) 


where  (q,  . . . q ) are  the  eigenvalues  of  A.  Also,  we  define  the 
X 8 

strong  norm  ||  A ||  as: 


II  A II  - max  |q  I 
1 ^ 


(16) 


If  a^j  are  kxk  matrices  and  A Is  ksxks,  we  still  have: 


- s"^  [ I J.a  J 


s s 


1-1  J-1 


IjJ 


(17) 


Let  (a  },  (B  } be  two  sequences  of  Hermltlan  sxs  matrices.  We  say  that 
n n 

they  exhibit  "mutual  approximation",  denoted  by  A 'v  B , If : 

n n 

a)  llA^II  . llB^II  , lA^i,  iB^i  are  all  bounded  from  above  by  a 
finite  number  M Independent  of  n. 


b)  lim  Ia  - B I - 0 
n n-*- 

n-*® 


(18) 


Let  - be  the  eigenvalues  of  A , B correspondingly. 

We  say  that  the  sets  {a^"^}  , are  "asymptotically  equally  distributed" 

in  the  Interval  I-M,M]  if 

1 M , |bj^"^|  < M . V k.n 

and  for  any  continuous  function  f(*)  on  we  have 

lim  n"^  I [f(a^"^  - f(b^"^l  - 0 (19) 

n ->■  00  k-1 

The  following  theorem  of  Grenander  and  Szego  [14]  will  be  used. 

Theorem  1:  Let  {A  };  {B  } be  two  sequences  of  Hermltlan  matrices 
n n 

with  eigenvalues  {a^"^},  (b,^"^}.  If  A 'v  B , and  either  limlA  I 

k k n n n-^ 

or  llm|B^|  exists,  then  {a^  {b^  are  asymptotically  equally 

n-M» 

distributed . 

We  have  found  until  now  that  the  kn  eigenvalues  of  a block  circulant 
matrix  of  the  type  (8)  are  grouped  according  to  (14).  However,  we  are 
Interested  in  asymptotic  expressions  of  eigenvalues  of  covariance  matrices 
of  the  type  (7)  with  ■ Rj^.  We  will  therefore  approximate  the  block 
symmetric  Toeplitz  matrix 


by  the  block  circulant  one. 


There  Is  no  unique  approximation  by  a block  circulant.  A convenient 
one  will  be  chosen  next.  It  is  a matrix  generalization  of  the  circulant 
approximation  used  by  Grenander  and  Szego  [14).  Let  F(X)  be  the  spectral 
density  of  the  stationary  random  process  in  question.  We  assume  that 


sup|f(X) I ^ M < +* 
X 

0 < m £ inf |f(X) | 

X 


(21) 


Let 


(l-|k|/p) 


L 0 


for  |k|  < p £ n 


Otherwise 


(22) 


(23) 


and 


F - I (l-|k|/p)  - f R,^ 


k-p 


Consider  the  block-clrculant  matrix 


k«-p 


(24) 


L - V L V"^ 
n n n n 


(25) 


where  L^  is  a block  diagonal,  with  diagonal  blocks: 


{L  } ■ F (2nmn  ^) 

n mm  p 


(26) 


and  where  is  given  by  (9).  It  is  easily  shown  that: 

I <”-•>"  .^(2.3„-h 

j“l  ^ 


(27) 


We  need  to  calculate  the  differences 


[l  - . iR^  - R 

^ n n-*-  n n'*’ 


We  have: 


-1 


m-1 


-1 


m-l 


m 


(28) 


IrJ  - R„l^  1 2n‘^  ^ m^p"^(p-in)iR^l^  + 2n"^  I (n-ni)iRj^  (29) 

in=p+l 


in=*l 


Due  to  (21)  we  have 


in«0 


and  thus  for  a given  c > 0 we  can  pick  a p so  large  that  7 I < c. 

in«p+l  *" 

By  choosing  first  p and  then  n sufficiently  large,  we  can  make  the 
distance  - R^|  sufficiently  small,  l.e. : 


1l  - R i < k ^ [w  n"^  + 2e]^^^ 

n n — 1 z 


(30) 


Is  Hermltlan  and  bounded,  and  Its  nk  eigenvalues  are  the  union  of  the 

eigenvalues  of  the  n matrices  {F  (27iinn  ^)  , m ■>  1,  2,  . . . , n}. 

P 

Let  h^(p,u)  be  the  largest  eigenvalue  of  Fp(u).  It  can  be  easily 
shown  that  h (p,u)  Is  a continuous  function  of  u.  According  to  Theorem  1, 

q 

for  any  continuously  differentiable  function  g on  [m,M],  we  have: 

f2n 


••1  ^ —1 
llm  n I g[h  (p,2Trnin  )]  = (2n)  g(h  (p,u)]du 

n^  m=l  oJ  ^ 


(31) 


Also, 


2ii  2if 

I I {g[h  (p,u)]  - g[h  (u)]}du|  < A • I Ih  (p,u)  - h (u)|du  (32) 


where  h^(u)  Is  the  qth  eigenvalue  of  F(u)  and  A Is  a bound  on  the  first 
derivative  of  g.  Thus,  as  p-»«>,  the  right  side  of  (32)  goes  to  0. 

In  conclusion,  if  p,n-»<B  In  the  manner  prescribed  in  the  development 
of  (30),  ve  will  have: 

-1  " -1  -1 

limn  I g(h  (p,2nmn'  ) ] ■ (2::)  gth  (u)]du  (33) 

n,p-H»  m-l  **  o'  ** 

(33)  was  developed  by  an  asymptotic  approximation  of  the  block  clrculant 
matrix  R by  the  block  clrculant  matrix  L . Simpler  block  clrculant 
approximations  to  may  be  developed,  along  the  lines  of  [15],  [16], [17] 
but  this  would  require  more  restrictive  conditions  on  Fj^(X),  than 

(21). 

In  the  following  section  we  will  apply  equation  (33)  to  the  calculation 
of  the  asymptotic  expressions  given  in  the  abstract. 


III.  asymptotic  expressions 


We  are  now  considering  two  stationary,  k-dimensional  vector 
Gaussian  processes.  Let  x”  “ (Xj^  . . . x^)  be  a sequence  of  n vector,  zero 
observations,  and  let  the  corresponding  covariance  matrices  be: 


- E[x"(x")*'|Hj]  , j - 1,  2 


nj 


R , R-  , 

oj  IJ 

R.  , ^4 

Ij  oj 


n-l,j 


oj 


(34) 


The  Bhattacharyya  distance  is  then  expressed  as: 


2B„  - log|2-\^  + 2"  R„,|  - 2"  log|R„J  - 2"  |r^,| 


n2 


hi' 


n2 ' 


(35) 


and  the  divergence: 


2J  - trace  [R  .k'I  + r'^R  - 21] 
n nl  n2  nl  n2 


(36) 


For  the  calculation  of  B,  we  observe  that: 

nk 


n ^log(R  , I - n I log  d 
1-1  ^ 


-1 


where  {d.  } are  the  kn  eigenvalues  of  R , . According  to  the  theory, 
in  nl 

they  are  asymptotically  equally  distributed  to  the  eigenvalues  of  a 


block  clrculant  approximation  L^.  Thus, 


mean 


where  is  the  unitary  matrix  (9),  and  block  diagonal 

matrices  corresponding  to  the  two  p-roodif ications  of  the  spectral  densities 
as  specified  by  (23),  (24).  Thus,  the  mth  diagonal  block  of 

{L  ,l“J)  - F^,  (2iinm"^)  F~J(2Ttmn“^)  (41) 

nl  nx  nra  pi  p/ 

where 

T M - ? (l-|kl/p)  , s - 1,2  (42) 

P k“-p 

The  kn  eigenvalues  of  thus  asymptotically  equally  distributed 

to  the  union  of  eigenvalues  of  the  n kxk  matrices 


{F  , (2Timn  ^)F  ^(27iran  ^)  , m = 1,  . . . , n} 

pi  p2 

Let  h^(p,2Timn  ^)  be  the  qth  ordered  eigenvalue  of  the  matrix 

F , (2iTmn  ^)F  ][(27imn  ^) . Using  (33)  with  g(x)  *•  x,  we  find: 
pi  p2 

2it 

— 1 ^ —1  —I  ( 

lira  n Zb  (p,27fmn  ) • (2n)  1 h (u)du  (43) 

n,p-*®  m-1  O'* 

where,  h (u)  is  the  qth  ordered  eigenvalue  of  F,  (u)F.,^(u) . Summing  over 
q 1 z 

q,  we  have: 


lim  n ^ trace 
n-*«> 


(2ii) 


-1 


2it 


trace  Fj^(X)F2^(A)dX 


(44) 


Collecting  terms,  we  find: 

-1  -1  -1 
2J  - (2it)  ^ trace[F^(X)F2  (X)  + F2(X)Fj^(X)  - 2I]dX  (45) 

It  is  interesting  that  (45)  can  stand  on  its  own  as  a distance  measure. 

(45)  has  found  applicability  is  speech  processing,  for  measuring  the 


distance  between  two  sounds  in  a subjectively  meaningful  way  [19],  In 
[19J»  the  scalar  case  k-l  was  utilized. 

Other  applications  of  the  two  new  dlstanc'.::  measures  are  envisioned 
in  EEC  signal  analysis  [20]-[22] . It  Is  also  plausible  that  a distance 
measure  of  the  type  (43)  or  (38)  may  be  a good  clustering  criterion 
the  space  of  spectral  densities.  If  one  wishes  to  "cluster"  EEC's 
for  the  purpose  ol  identifying  "disease  clusters",  (38)  and  (43)  may  be 
useful  measures  due  to  the  availability  of  spectral  density  estimates 
of  EEC's  1 20] . Furthermore,  the  association  of  (38),  (43)  with  the 
probability  of  mlsclassif icatlon.  Is  an  Intuitively  appealing  factor. 

It  can  be  easily  shown  that  J(Fj^,F2)  is  convex  in  F^  for  fixed  F^^, 
while  B is  neither  convex  nor  concave.  This  is  an  advantage  in  favor 
of  J.  On  the  other  hand,  B provides  better  bounding  expressions  to  the 
probability  of  misclassif icatlon  than  J does. 

The  geometry  of  in  the  space  of  probability  measures  has  been 
analyzed  in  [23],  [24],  and  several  convenient  geometrical  properties  were 
established. 

A criticism  against  B^  is  that  it  is  not  a true  distance  measure 

because  it  does  not  satisfy  the  triangular  condition.  However,  we  shall 

show  that  there  is  a one-to-one  correspondence  of  B to  a proper 

n 

distance  measure. 

Let 


H - H (f.(x‘'),f,(x"))-  { 
n n X z 


,1/2,  n,,2.  n.1/2  (46) 

fj  (x  )]  dx  ) 


I [fj^^x")  - 

is  the  Helllnger  distance  (1  ] between  fj^(x'''),  f2(x''),  and  obviously 


satisfies  the  triangle  condition.  Furthermore, 


[n"^B  - bI  < (48) 

n “ 3 


Thus, 

exp(-B  ) - b"  d“^ 
n n 

-1  -1/2 

where  b ■»  exp(-B),  d - exp(c  n ).  Substituting  in  (45),  we  have: 

n o 

b - d (49) 

n n 

As  n-H»,  d -♦I,  thus 
n 

b a [1-2"^H^]^^"  (50) 

n 

Equations  (47),  (48)  give  a one-to-one  correspondence  between  b and  H^. 

The  function  relating  them  is  dependent  on  n,  but  has  a simple  structure. 

In  Figure  1 the  functional  relationship  is  drawn. 

Experts  in  the  areas  of  feature  selection  and  clustering  techniques 
in  Statistical  Pattern  Recognition  have  observed  (25]  - 129]  that  both 
methods  are  computationally  expensive  and  exhibit  strange  behavior  in 
high-dimensional  measurement  spaces.  It  is  envisioned  that  the  following 
two-step  techniques  mpv  alleviate  the  above  problems. 

(I)  Clustering 

(la)  Estimation  of  the  spectral  density  F^(X)  from  the  ith 
observation  record 

{x^(k),  k ^ 0}  , i « 1 M 


(Ib)  Clustering  the  M records  in  the  space  of  the  measured 

F^(X)  and  using  as  distance  measures  the  numbers  B or  J. 

(II)  Feature  Selection 

(Ila)  Same  as  (la).  For  M>2. 

(lib)  Find  the  linear  kxs,  s < k transformation  A that  reduces 
the  amount  of  data  and  maximizes  J or  B. 

The  point  of  view  In  proposing  tha  above  methods  Is  as  follows.  In 
practice,  any  time  series  {x^(k),  1 £ k £ T)  observed  for  a long  time 
Interval  T,  exhibits  statistical  dependence  only  for  pairs  of  samples 
that  are  neighbors  In  time.  In  clustering  T-dlmenslonal  vectors  by  the 
usual  techniques,  a large  T will  Impose  substantial  computing  resource 
requirements.  Furthermore,  the  convergence  of  clustering  procedures  may 
be  In  doubt,  and  the  statistical  Independence  between  distant  samples 
is  not  utilized.  It  is  believed  that  the  idea  of  fitting  models  to  the 
data  and  then  clustering  the  models  according  to  J or  B in  a low- 
dlmenslonal  space  may  alleviate  the  stated  difficulties. 


Figure  1 
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under  AFOCRA  Grant  77-3156  and  AFOSR  Contract  FUU  620-T3-C-0065. 
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1.  Introduction 

In  his  classic  paper,  Hampel  (1971)  introduced  a definition 
of  robustness  in  parameter  estimation  that  accurately  reflected  the  intiii- 
tive  notion  that  a sequence  of  estimates  of  a peirameter  was  robust  for  an 
observation  process  y if  Mother  process  v that  was  "close”  to  y yielded 
a "close"  distribution  on  the  parameter  estimates.  Hampel  considered 
memoryless  or  independent,  identically  distributed  (i.i.d. ) observation 
processes  and  measured  their  "closeness"  by  the  Prohorov  distance  on  the 
marginal  probability  measures.  As  he  considered  i.i.d.  processes,  his 
underlying  parameter  depended  implicitly  only  on  these  unknown  marginals. 
Hampel  then  proved  that  weak  continuous  functionals  on  the  space  of 
probability  distributions  defined  robust  sequences  of  estimators  under 
his  assumptions.  He  also  showed  his  resxilts  could  be  adapted  via  an 
alternative  notion  of  robustness  to  weakly  dependent  observations,  in 
particular,  observations  that  were  close  to  memoryless  in  a Prohorov  sense. 

A critical  part  of  his  derivation  was  the  fact  that  if  two  i.i.d. 

processes  y and  v are  close  in  a marginal  Prohorov  sense,  then  one  could 

construct  a pair  process  p having  y and  v as  coordinate  processes  and  such 

that  under  p the  sample  distributions  of  two  coordinate  n-tuples  x°  produced 

by  V euid  y"^  produced  by  v were  close  in  a Prohorov  sense  with  high  probability. 

*P.  Papantonl-Kazakos  completed  this  work  at  Rice  University,  Houston,  Texas. 
■f'R.  M.  Gray  is  with  Stanford  University. 

^This  research  was  supported  by  the  Air  Force  Office  of  Scientific  Research 
under  AFOSRA  Grant  77-3156  and  AFOSR  Contract  620-73-C-0065 . 
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During  the  past  few  years,  a genersLlization  of  Ornstein's  d distance  of 
ergodic  theory  (called  the  "rho-b€ir,"  or  generalized  Ornstein  distance) 
has  been  shown  to  provide  a similar  control  for  sample  distributions  for 
general  stationary  and  ergodic  processes  and,  largely  as  a result,  has 
foiand  numerous  applications  in  information  theory  (see,  e.g.,  Gray, 

I'feuhoff  and  Shields  (1975),  Gray,  Neuhoff,  and  Onura  (1975),  Gray,  Neuhoff 
and  Ornstein  (1975)  )•  In  this  paper  we  show  that  using  the  p distance 
as  a measxire  of  closeness  of  the  observation  processes,  there  is  a natural 
qualitative  definition  of  robustness  for  all  stationary  ergodic  processes, 
that  a weakened  version  of  Hampel's  weak-continuous  estimator  sequence 
implies  robustness  and  that  all  of  Hampel's  results  have  analogs  in  this 
more  general  case.  Our  formalism  does  not  quite  contain  Hampel's  in  the 
case  of  i.i.d.  processes  and  parameters  depending  only  on  the  marginal 
probabilities,  but  is  a strict  generalization  in  some  cases  such  as  when 
the  metric  on  the  observation  alphabet  is  bounded  or  when  the  class  of 
probability  measures  considered  is  constrained  to  have  a finite  second 
moment  (see  Lemma  2.1). 

We  also  note  that  we  need  not  confine  estimates  to  take  values  in 
R as  Hampel  does,  but  instead  we  only  require  that  the  parameter  alpha- 
bet be  a complete,  separable  metric  (Polish)  space.  Hence  function  valued 
parameter  spaces  are  allowed. 

As  a side  result,  some  easy  generalizations  of  the  convergence  of 
sample  distributions  [Parthasarathy  (1967 )]  for  stationary  and  ergodic 
processes  are  developed. 


2.  Preliminaries 


Let  (0,  S^)  be  a measurable  space  such  that  n is  a complete, 

separable  metric  space  (or  Polish  space)  with  metric  p and  is 

the  Borel  o-field  generated  by  the  open  sets  under  p.  Since  n Is 

separable,  there  is  a countable  collection  of  sets  = (G^;  i = 1,2,...} 

such  that  = o(gj^),  that  is,  is  the  o-field  generated  by  §^. 

Let  n"  be  the  space  of  n-tuples  with  coordinates  in  0 and 

the  space  of  sequences  u = (. . . n all  i.  Let 

s”  be  the  o-fleld  of  subsets  generated  by  all  rectangles  of  the  form 

? , B t i (since  0 is  Polish  s"  = a(§”),  the  o-field  generated 

i=0  1 jl  n n n 

00 

by  rectangles  with  g^).  Let  be  the  o-field  generated  by 

all  rectangles  of  the  form  B = {(j:u^€  B^,  n ^ i ^ m},  B^e  Let  ^ 

a CO 

be  a probability  measure  on  the  measurable  space  (H  , yielding  a 

00  00 

probability  space  (Jl  The  sequence  of  coordinate  functions 

~.  00  00 

X^;r. -»  Ti  defined  by  X^((j)  = n = ...,-1,0,1,...  on  (f) 

forms  3 random  process  and  is  denoted  either  by  [fl,  (a,  x]  to  emphasize 

alphabet  r2,  measure  |a,  and  name  X,  or  simply  by  p to  emphasize 

measure,  or  bw  {X  } to  emphasize  name, 
n 

00  00 

Let  T;n-»  (1  denote  the  shift  transformation  defined  by 

X (Tcj)  = X ,(u>).  The  process  p is  stationary  if  p(TF)  =p(F) 
n n+i 

00 

for  all  F t 3^.  The  process  is  ergodic  if  TF  = F implies  p(F)  = 0 
or  1. 

n n ® n 

Denote  (o). , . . . ,0)  ,)  by  w and  define  X :Q-*Q  by 

u n**i 

x”(u))  = (X_(u)),  X (u)),...,X  ,(u)))  =0)".  Let  p”  denote  the  restriction 

u 1 n— 1 

of  p to  (n",»^),  that  is,  if  F € then  p”(F)  = p(x")"^(F)  = 

. n ..V 
p (u):u  € F). 
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■n 


Let  TT]^  denote  the  class  of  all  stationary  processes  with  alphabet 
Q and  let  tt|^  denote  the  class  of  all  stationary  and  ergodic  processes 
with  alphabet  Q.  To  avoid  confusion  we  will  often  use  different  names 
with  different  measures,  e,g.  , typical  members  of  are  [q,  p , x] 

and  [n,  V,  y]. 

A process  [fl,  X]  is  said  to  be  i.i.d.  if  for  every  rectangle 

B = have  p"{B)  = Let  denote  the 

collection  of  all  i.i.d.  or  memoryless  processes  and  note  that 

n C r - ' , 

In  e s 

Given  tv?o  processes  p. , v the  generalized  Ornstein  distance 

or  c distance  between  ^ and  v can  be  defined  as  follows:  for 


n n .n 
X ,y  = set 


P^Cx^.y”)  = n~^  S c(x  ,y  ) 

n ...  1 1 


n-1 


i=0 

and  define  as  the  set  of  all  measures  p”  on 


having 


and  as  coordinates,  that  is,  p^(n*'xF)  = v”(F), 


p”(FXo”)  = _”(F),  all  F€  8*^.  Define  the  n^^  order  distance 

it 


— , n n. 

PnCi  iV  ) = 


inf 

p € P(p  ,V  ) 


E D 
P n 


(2.1) 


and  the  c distance  by 


p(lj.,v)  = sup 
n 


(2.2) 


If  with  a slight  abuse  of  notion  we  also  let  X**  and  y”  denote  coor- 
dinate functions  on  n”x  n”  so  that  if  z = (x”,y''')  € Q^x  n”,  then 
x"(z)  = x",  Y"(y)  = y",  then  (2.1)  also  can  be  written 

p„(p",v")  = inf  EpP^(x",Y") 

nt  n n . 

p € PO*  , 7 ) 
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Thus  p^(u”|V*')  measures  the  smallest  possible  expected  "distortion*' 

between  x"  and  Y*'  over  all  stochastic  links  preserving  the 

probabilistic  description  of  each.  We  note  p is  the  Vasershtein- 

n 

distance  between  the  random  vectors  x"  and  y"  described  by  p." 
and  v**  [Vasershtein,  (1969)].  The  following  are  some  useful 
properties  of  p for  later  use. 


Properties  of  p [Gray,  et.  al. , (1975)  ] 
(i)  lim 


n -*  00  exists  and  equals  sup^  v*'). 


(ii)  If  j,  and  v are  l.i.d.  , then  pOi.v)  = Pj(iJ.',v'), 


(tii)  oC-i”)  ^ P(^* . t)  + P(li  '■')  (triangle  inequality). 


(iv)  The  distance  can  also  be  defined  as  follows:  Let  P Oi,v) 

s 

be  the  collection  of  all  stationary  pair  processes  with  coordinate 

OD  00  00  00 

processes  _ and  v,  that  is,  all  measures  p on  (I)  X I)  , S^) 

such  that  p(n”x  F)  = v(F),  p(FX  fi“)  = ^(F),  all  ^ € B*  X 

(where  we  use  T to  denote  the  shift  on  0*  X n”  as  well  as  on  n”). 

In  a similar  fashion  let  Pg(lJ.,v)  denote  the  class  of  all  stationary 

and  ergcdic  pair  processes  with  ^ and  v as  coordinates.  Define  the 

coordinate  functions  X f)  n X by  (X^,Y^)(x,y)  = 

(X  (x),Y  (y))  = (x  ,y  ).  We  have  that 
n n n n 


p(p,v)  = inf  EpP(XQ,Y^j) 

p € P^(ji,v) 


(2.3a) 
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and  if  >1 , V t TT]^, 


P(^*,v)  = inf  EpP(Xjj,Y^) 

p c (vi  , \< ) 


(2.3b) 


Wo  note  that  (2.3b)  follows  from  (2.3n)  via  the  ergodlc  decomposition 
of  stationary  processes  [see  Oxtoby  (1952)  or  Rohlin  (19-19)]. 

Another  important  property  of  p is  that  it  Is  the  closest  that 
generic  (typical,  regular)  sequences  of  m-  and  v (those  sequences 
whose  sample  iiverngos  converge  to  expectations  of  enough  functions  to 
determine  the  measure)  can  be  matle  to  each  other  in  a limiting  n 

n 

sense  [(iray,  et.  al.  , (1975) ).  It»  tlie  next  section  we  develop 
a resul*  for  sample  distributions  similar  to  that  of  Hampel  and  Parthp sa ra thy 
since  the  e.misting  p result  Is  not  directly  useful  here  hecause  it  Irvoives 
n different  typo  of  sample  nv<'rage.  The  basic  idea  is  that 
p closor>sj  of  two  processes  will  imply  that  with  high  probability  the 
process  vtll  produce  close  sample  distributions. 

Hampp  1 used  the  Prohorov  metric  between  and  v- ^ to  measure 

the  distance  between  1.1. d.  processes  and  r.  We  can  define  a 
Prohorov  distance  between  processes  using  a generalization  of  Moser, 
et.  al. , '1975)  and  this  distance  can  be  easily  related  to  [i  by 
using  the  Strassen-Dudloy  form  for  the  Prohorov  distance  [strassen  (19(55)  , 
Dudley  (lOsiS)]:  Define  the  n^*'  otxier  Prohorov  distance 


,v  ) 


p c P(u".v") 


inf{;:p(x",y”:p^^(x*',y")  > /), 

(2.-1) 


which  la  the  Prohorov  metric  between  and  with  rosptx't  to  the 

metric  (which  generates  tlu’  product  topolonj-),  and 
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1 

f ' 
( 


n(^,v)  * sup  . (2.5) 

n 

It  Is  known  [Strassan  (1965),  Dudley  (1968)]  that  a achieving 

the  Infimurn  exists.  We  have  Immediately  using  Chebychev's  Inequality 

[as  In  Dobrushin  (1970)]  that  If  p"  achieves  p (l.e.,  E p p = p , 

n n n n 

In  the  Appendix  it  Is  shown  that  the  Infimurn  is  a minimum  for  Polish 
alphabets),  then 

P (x  ,y  : p (x  ,y  ) > €)  < E p p /€ 

■*  n 

^ P„(^",v")/€ 


and  h«!noo  choosing  Pj^Oj-^.v'^)  » yields 

ri  .n  n.>-,n  n.1/2.  . - ^ n n.  1/2 

p (x  .y  ;p„(x  ,y  p„(m  ,v  ) ) ^ p Ox  ,v  ) ' 

n n n 


whence 


„ , n n.2  . - / n n. 
1 (a  ,v  ) i p Ci  ,.  ) 

n n 


2 - 

n(u,v)  i PU.v)  (2.6) 

so  that  closeness  In  p is  stronger  thon  closeness  in  Prohorov.  In 
some  cases  the  two  distances  generate  the  same  topology,  however,  as 
the  following  easy  Lemma  shows. 


Lema  2. 1 

(a)  If  p is  bounded,  thon  p^  and  IT^  generate  the  same  topology 
(and  hence  so  do  p and  IT). 

(h)  Given  a class  nj,  of  processes  p such  that  there  exists  an 

a such  that  E |xp(X»,a  ) s p ^ » , then  p and  n 

n n 

generate  the  same  topology  on 


i' 

I. 


li 


<1 
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proof. 


(a)  Let  p be  the  largest  value  of  p,  then  if  p yields 
max 

n we  have 
n 

^ + p"(x",y":p^(x",y")  > 

\<P"*^"»Pmax  = ^Pmax> 


and  hence  small  IT  Implies  small  p which  with  (2.6) 
n n 


proves  (a). 

(b)  We  have  similar  to  before  that 


„ ,.n.  _ „ . n i'  . n,  n n.  , n n. 

,Y  ) 5 ) + J dp  (x  ,y  )Pj^(x  ,y  ) 


x^y^D^(x^y")>Vu‘y‘) 

Let  G = {x*',y*':p  ^ and  let  1 be  the 

n n (j 

Indicator  function  for  G.  .‘^ince  P^  is  n metric,  we  have 

Iron  the  triangle  inequality  and  the  Couchy-Schwartz  Inequality 

E p P (x",y")s>  H (^",v")>E  p c (X",n*")l  +E  P„P- (y"  . «*" ) I- 
nn  n nn  unn  G 


. n n.  , , ,n  ♦n.2.1/2,„  ,2.1/2 

T (u  ,v  ) + (E  P„P„(X  ,a  ) ) P„V^ 

n n n n G 


,2,1/2 

+ (E  P„P_(Y  ,a  ) ) (E  p I ) 
n n n u 

^ » n n.  _ *1/2  , 

s n ,v  ) + 2p  P„(G), 
n n 

= n (u",v")(i+2p*^^) 
n 


completing  the  proof  ns  before. 
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1 


\ 


n 


i 


We  use  p and  not  H as  a distance  measure  on  observation  processes 

for  several  reasons,  primarily  because  p has  several  properties  useful 

for  robustness  (and  other)  studies  that  IT  does  not.  In  particular, 

(1)  the  sup  H need  not  be  achieved  In  the  limit  n -»  » as  is  p. 
n n 

As  a result  there  Is  no  process  definition  for  IT  = II(Mi,v)  analogous 

to  (2.3).  This  means  there  need  not  exist  a single  stationary  p such 

that  p(x,y:o^(x”,y")  > IT)  s n for  all  n.  The  p yielding  p, 

however,  guarantees  that  p 0-,v)  = E p(X„,Y-)  = E p (x",y”)  and 

p o u p n 

hence  via  Chebychev's  inequality  It  Is  true  that  p(x,y:p  (x",y*')  > p^^^) 

n 

—1/2 

a p for  all  n.  This  uniform  bound  for  all  n is  crucial  to  prove 
robustness.  (2)  If  ^ and  v are  i.l.d. , then  p = p^  and  hence 
marginal  closeness  of  p^  in  such  a case  guarantees  process  closeness 
of  p.  The  analog  Is  not  true  for  Prohorov,  that  is,  it  is  not  true 
for  -j.  , i.l.d.  that  HO*,’  ) = HjCa^v^).  It  need  not  even  be  true 
that  given  €>  0 there  exists  a 6 such  that  n^Ci^.v^)  <6  Implies 
II(^,v)  < £.  Thus  marginal  closeness  of  Prohorov  does  not  ensure  process 
closeness  for  i.l.d.  processes.  As  a result,  using  n(^,v)  as  a closeness 
notion  would  not  be  a strict  generalization  of  Hampel's  definition  of 
robustness  for  I.l.d.  processes.  (3)  The  p distance  between  processes 
can  often  be  explicitly  evaluated  or  bounded  (as  in  the  Gaussian  case) 
making  it  useful  in  applications.  No  general  bounds  to  H (except  in 
terms  of  p via  (2.6))  exist.  (4)  It  is  p and  not  IT  that  allows 
a simple  demonstration  that  close  processes  likely  produce  sample  functions 
with  close  sample  distributions  (as  in  the  next  section).  Hampel's 
Porohorov  approach  worked  in  the  i.l.d.  case  because  he  was  able  to 
produce  an  i.i.d.  pair  process  p with  the  correct  coordinates  by 
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simply  taking  the  1.1. d.  process  with  the  marginal  yielding 
If  ^ and  V were  not  i.i.d. . p constructed  In  this  way  would  not 
have  ^ and  v as  coordinates.  The  p avoids  this  problem  since  it 
has  an  equivalent  definition  in  terms  of  processes. 

As  a final  observation,  one  could  also  define  a Prohorov  distance 


on  processes  via 


P„(x,y)  = E 2“^ P(x^,y^)/(l+p(x^,y^)) 


n'Oi.v)  = inf  inf (r:p(x,y:p^(x,y)  > r)  s r) 

P 

This  dis-sr.ce  generates  the  weak  topology  on  but  it  is  of  limited 

use  because  if  "favors"  times  near  zero  in  determining  the  metric  p . 

1 ® 

-1 

It  is  the  limiting  behavior  of  n E c(x^,y^)  and  not  that 

is  impor-ar.t  in  most  applications  (such  as  robustness  and  problems  in 

information  theory).  In  particular,  small  p will  be  seen  to  force 
-1 

n 2-.  to  be  small  for  all  n with  high  probability,  IT' 

is  not  "strong"  enough  to  imply  this. 

Ever,  though  we  have  argued  that  p is  the  appropriate  distance 
measure  on  processes , the  Prohorov  metric  is  quite  adequate  as  a 
measure  of  distance  of  random  variables,  and  hence  for  many  intermediate 
steps  we  will  use  the  weaker  Prohorov  distance  to  follow  Hampel's  basic 


approach  where  possible. 


J 


3.  Sample  Distributions 

Hampel  (1975)  following  Parthasarathy  (1967)  considers  only 
marginal  sample  distributions  of  the  following  kind:  Given  an  n-tuple 
x"e  n",  define  the  measure  |i^  ^ on  (0,38^^)  by  assigning  probability 


-1 


to  each  i = 0,l,...,n-l  (if,  say,  k of  the  x^  are 


identical,  this  point  gets  probability  k/n  ).  This  assignment  givea 


a measure  on  (ft>3B„) , via 

x"  ” 


(F)  = £ n ^ 


X l:x^eF 

Parthasarathy  (1967)  proves  that  for  an  i.i.d.  process  , 

1 1 

IT,  0*  ,1^  ) -*  0 , fi-a.e. 

1 n 

X n-** 


(3.1) 


We  shall  wish  to  consider  more  general  processes  and  parameters  depending 

on  the  whole  process  and  not  just  the  marginal  . Hence  we  wish  to 

estimate  nore  than  Just  the  marginal  from  x”.  Given  an  n-tuple 

x”€  form  an  estimate  of  the  entire  underlying  process  as  follows: 

Form  the  periodic  string  x = (. . . ,x*',x'^,x*',. . . ) , that  is,  x = x 

k k mod  n 

09  99 

Define  the  measure  ^ on  (n  ,S^)  by  placing  probability  n on 

X 

each  string  T^’x,  i = 0,1,..., n-1  (grouping  together  identical  strings 
as  before),  that  is, 

^ „(F)  = £ n"^  all  F e B*  (3.2) 

x"  1-  ^ 

i:T  X € F 

The  process  is  periodic  as  defined  by  Parthasarathy  (1961)  since 

|ji  (F  n T^F)  = (F),  all  F € B_.  It  is  also  easily  seen  to  be 

n _n  Si 
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stationary  from  (3.2).  Furthermore,  If  TG  = G and  hence  T ^G  = G, 

then  if  T^x  € G for  any  i,  then  T^x  € G for  all  j and  hence 

^ (G)  = 0 or  1 and  the  process  is  ergodic.  The  process  (i  has 

n n 

* k -1  ^ 

restrictions  u which  assign  measure  n to  each  k-tuple  obtained 
n 

X 

by  viewing  k adjacent  symbols  within  x*^  or  an  "overlap"  k-tuple 

constructed  by  • '*n-l ’*0’ " ' '*k+l-n^  ' ^ ” n-k+1 , . . . ,n-l.  In 

1 

particular,  ^ is  the  same  as  the  Parthasarathy  marginal  sample 

X 

distribution.  Note  that  only  if  k ^ n are  the  sample  distributions 

ti  K 

"trustworthy,"  but  it  is  in  fact  the  sample  distributions  ,n  2 k . 

n 

X 

that  will  be  most  important.  This  raises  an  alternate  (and  more  common) 

approac  * of  ^iven  x*',  define  the  restrictions  (and  not  a process) 

k —1  n 

u.  bv  assigning  (n-k)  to  each  of  the  (n-k)  k-tuple  within  x . 
n 

X 

We  do  net  rake  this  approach  since  (1)  it  is  useful  to  have  a process 
implying  all  the  restrictions;  (2)  it  is  convenient  to  have  n ^ be 
the  probability  of  the  atoms  for  all  k and  the  resulting  proofs  are 
simpler;  and  (3)  properties  of  periodic  processes  make  it  easy  to 
demonstrate  that  a certain  seemingly  reasonable  conjecture  is  in  fact 
false.  The  two  approaches  obviously  yield  identical  results  for 
fixed  k and  large  n since  the  overlap  effects  die  out  as  n -»  ®. 

The  main  result  of  this  section  is  the  following  generalization  of 
(3.1). 
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Lemma  3. 1 


If  is  stationary  and  ergodic,  then  for  any  fixed  k 


lim  = 0 , H-a.e. 

n -♦  * X 


(3.3) 


If,  in  addition,  there  exists  a reference  letter  a such  that 


E^p(XQ,a*)  i p*  < « , 


(3.4) 


then  for  any  fixed  k 


lim  Pj^(p.  = 0 , 

n -»  ® X 


>i-a.  e. 


(3.5) 


Proof. 

For  ary  G € the  Birkhoff  ergodic  theorem  states  that  with 


i -probability  one 


>*  J«=> 

X 


= n 


, (n-1-1  , n-1 

•'I  Z i,(x*)  t z 

( i=0  l=n-k+l 


- »Vl’^0’‘“  '"'i+k 


->! 


^ (G) 


n-»“ 


Hence  since  is  countable,  there  is  a set  A e such  that 


^x(A)  = 1 


and  if  x € A, 


p ^(G)  -♦  n (G) 

X ne® 


all  G € g"  . 

A 


n U 

Since  generates  S , we  have  from  Billingsley  (1968)  that  for 

A iJ 

k k 

X c A,  ^ ^ weakly  and  hence  (3.3)  holds. 

X 

If  in  addition  (3.4)  holds,  let 

„ fkk  .kkv>„,kkv» 

B = (u  ,y  :Pjj(u  ,y  ^,m.  )J 
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let  p € Pc*  yield  ^ n’^"^^’ 

x“  X 

let  denote  the  coordinate  random  vector  of  p corresponding  to 

k X k 

u and  V that  corresponding  to  p . We  then  have 


k k. 


k k. 


— , k k^^„  ,k  k.  k k.  _ ,k  k.,  , k k. 

X 

k-1 


n Oi  ‘‘,ji’")+k"^  S E p(U  ,a*)l  (U^,v’‘) 

k x"  1=0  P ^ ® 

k-1 

- k”^  E E P(V  a*)l  (U^'.v^) 

^ _ P 1 D 


i=0 


and  hence  from  the  Cauchy-Schwartz  inequality  and  the  stationarity  of 


p ann 
n 

X 


- , k k 
P n- 

X 


k-1 


*,2,1/2 


) 2 TI.(ui  ’‘,(j,’')(l-k"^  E (E  p(V  a*)^) 

K n 4 _n  p i 


2 'V‘’i 

i=0 

r (p.  ]j,>i’‘)(l+(E  p(UQ,a*)^)^^^  + (E  pCVg.a*)-)^/^) 
X P 


X i=0 

(E  p(V 


n„(p  ‘',^i'‘)(l+(n■^  E p(x^,a'‘)‘')^^‘^  + p *'“) 

x”  i=0  ‘ 


n-1 


*,2,1/2  . *1/2, 


*,2  * 

As  n -*  =>,  the  sum  goes  to  ^ ^ P hence  with  ^ 

probability  one 

11m  7 i lim  n.  (>*  ''.j*’')(l+2p*^^)  = 0 , 

K n K n 

n-»<»  X n-*"  X 

completing  the  proof. 
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One  might  hope  that  a stronger  result  would  hold  to  the  effect  that 


p(p  ,p)  -♦  0 or  II(^*  ->  0 P-a.e.  That  pCp  -»  0 Is  impossible, 

XX  X 

however,  even  for  general  finite  alphabet  processes  since  in  that  case 

with  p being  the  Hamming  (discrete)  metric  convergence  in  ^ (in 

this  case  called  d and  being  Omstein’ s distance)  implies  convergence 

in  entropy  [Shields  (1975)],  yet  periodic  processes  have  entropy  zero 

and  hence  cannot  converge  in  p to  a process  with  nonzero  entropy. 

Furthermore,  in  this  case  we  have  seen  that  p and  H are  equivalent 

metrics  and  hence  it  is  not  possible  for  n(,i  ,^ ) -»  0 ►i-a.e.  for 

X 

nontrivial  processes.  Roughly  speaking,  sample  distributions  can  describe 
th 

the  k order  restrictions  of  a process  to  arbitrary  accuracy  as 
n -*  <»  and  any  fixed  k,  but  they  cannot  approximate  the  k^**  order 
restrictlors  for  all  k simultaneously,  thereby  forcing 
zero.  T.-.is  observation  leads  to  some  of  the  definitions  generalizing 
those  of  Hanpel  to  stationary  ergodic  processes. 


i 

I 

i 


IS 


4.  Sequences  of  Estimators 

A sequence  of  estimators  ^ sequence  of  measurable 

mappings  A,  n = 1,2,...  where  the  parameter  space  A is  a 

Polish  space  with  metric  d and  is  the  Borel  o-field  of  subsets 

of  A.  Unlike  Hampel,  we  do  not  consider  to  depend  on  its  argument 

x"  only  through  , that  is,  S (x”)  is  not  assumed  to  be  invariant 

n n 

X 

n k 

under  permutations  of  x . In  addition,  A need  not  be  JR  with  the 

Euclidean  metric  as  in  Hampel,  allowing  more  general  function  spaces. 

In  some  cases  there  will  exist  a "true"  value  SCi)  of  the  parameter 

of  the  process  being  estimated  by  the  sequence  Analogous 

to  a special  case  considered  by  Hampel,  if  -»  A is  the  mapping 

giving  the  "true"  parameter,  one  candidate  for  the  sequence  of  estimators 

is  = SCa  j^)i  the  parameter  associated  with  the  periodic  process 

X 

obtained  from  the  sample  n-tuple.  Examples  are  the  sample  mean 
(S^(x''')  = n ^ ^-0  *i^  sample  correlation 

(S  (x")  = n ^ i X.  . X j ) which  are  simply  the  mean  and 

n 1=0  Iraodn  (l*t)mod  n ' 

corivl  ion  of  the  process  p . Certain  results  analogous  to  those  of 

X 

Hampel  "111  be  proved  for  this  special  case. 


Definition 

(i)  A parameter  -*  A is  said  to  be  weakly  continuous  at  ^ 

with  respect  to  the  p distance  if  given  e > 0 there  exists  a 
5>  0 such  that  p(p,v>  ^ 6 implies  d(S(p),S(v))  < €. 

(ii)  A parameter  S:Tt^  -»  A is  said  to  be  strongly  continuous  with 
respect  to  the  p distance  if  given  € > 0 there  exists  a positive 
Integer  k and  a 0 such  that  if  pJ^(^i*', v*')  ^ 6,  thai 
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d(SC*),S(v))  < €. 


I 


(ill)  A parameter  S:Ti)g  -»  A is  said  to  be,  simply,  strongly  continuous 
(or  strongly  continuous  with  respect  to  the  Prohorov  distance)  if 
given  € > 0 there  exists  a positive  integer  k and  a 6 > 0 such  that 
if  IIj^(y.’^,v’')  ^ 6,  then  d(S(M.)  ,S(v))  < €. 

It  follows  from  the  properties  of  the  distance  that  strong  continuity 
=»  strong  continuity  with  respect  to  the  p distance  =»  weak  continuity 
with  respect  to  the  p distance. 

The  szrong  notions  of  continuity  are  required  when  considering 
sample  distributions  as  there  the  conditions  of  or  p^^  being  small 

can  be  n-et,  while  the  condition  of  small  p in  general  cannot. 

If  incier  ^ a sequence  of  estimators  fs^}  converges  in  probability 
(under  _)  to  a value  S Qi),  that  is,  if  for  all  €>  0 

lim  p(x:d(S  (x'^),S  C-))  > e)  = 0 , (4,1) 

jl  ^ 

n -»  ® 

then  we  say  is  consistent  for  S^C*)  under  r-.  As  pointed  out 

by  Har'.pel.  need  not  be  the  same  as  the  "true"  parameter  value 

SOi),  but  in  such  a case  S^(i-i-)  might  be  a better  definition  of  the 

"true"  parameter  given  the  S . 

n 

A sequence  of  estimators  {S^)  on  a process  i-  induces  a family 
of  probability  measures  on  (A,®^)  defined  by 

ki"s"^(F)  = p"(S~^(F))  , all  F € (4.2) 


Lemma  4. 1 


If  S-.TTL.  -»  A is 


(1)  a strongly  continuous  parameter  at  p € ri^,  or 
(11)  a strongly  continuous  parameter  at  p,  € with  respect  to 
the  p distance  and  there  exists  a reference  letter  in  the 


sense  of  (3.4), 

then  the  sequence  of  estimators  (s  } given  by  S (x”)  = S(p  ) is  con- 

n n n 

sistent  for  S at  p.  * 

Proof . 

(i)  Given  € > 0,  chose  k,  6 such  that  II  (p^,v*^)  6 implies 

k 

d(S(p),S(v))  < e.  From  Lemma  3.1,  there  is  an  njj  sufficier 

k k 

large  to  ensure  that  if  n k n^,  then  p(x:Ilj^(p  ^,p  ) > 6)  ^ €, 

X 

n s n^,  and  hence  p (x:d  (S^(x")  ,S  (p)  ) > e)  i p(x;IIj^(p  > 6) 

X 

Si,  n 2 n^,  completing  the  proof. 

(ii)  As  in  (1)  with  H.  replaced  by  p . 

K K 

Lastly,  let  denote  the  Prohorov  distance  between  measures  on 

(A,:B^)  with  respect  to  the  metric  d. 


5.  Robust  Sequences 
Definition. 


Given  a collection  of  processes  > sequence  of  estimators 


(S  ) is  robust  for  tti  at  a process  p if  given  e > 0 there  is  a 
n 


6 > 0 such  that  for  all  n and  all  processes  v e 


(A)  p(ptv)  < 6 =»  n < e 

an  n 


The  definition  is  intuitively  the  same  as  Hampel's:  A robust 
sequence  is  one  for  which  close  observation  processes  imply  uniformly 
(over  n)  close  estimate  distributions.  Hampel  defines  robustness 
only  at  l.i.d.  processes  and  only  for  the  class  of  all  1. i.d. 


processes.  In  the  case  of  (A)  is  equivalent  to 


p(p,v)  = p^(p^,v^),  the  marginal  distance  being  small.  Since 


s:  Pj^Ci^jV^),  robustness  at  an  i.l.d.  process  for  in  our 


sense  is  slightly  weaker  than  Hampel's  robustness.  If  p is  bounded 
or  we  add  the  constraint  to  TT^  that  there  exist  a reference  letter  as 
in  Lemma  2.1,  then  for  the  two  notions  for  robustness  at  an  i.i.d. 

process  are  equivalent. 

The  following  auxilliary  definitions  will  prove  useful. 


Definition. 

(i)  A sequence  of  estimators  (S^}  is  asymptotically  robust  for  a 
collection  ttL  ^ at  p if  given  € > 0 there  is  a 6 > 0 

and  an  n^  such  that  for  all  n ^ n^  and  processes  v € 

(A)  holds  true. 

(li)  A sequence  of  estimators  {s^l  is  small  sample  robust  for  a 
collection  r C Tt^  at  p if  for  any  integer  n^  and  any 
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€ > 0 there  is  * 6 > 0 such  that  (A)  holds  for  all 


n = 1,2,...  ,11^. 

Lemma  5. 1 

If  a sequence  (S^)  Is  both  asymptotically  robust  and  small  sample 
robust  for  ■q  at  p. , then  it  Is  robust  for  q at  p. 

Proof. 

Given  € > 0,  choose  6^,  n^  such  that  (A)  is  satisfied  for 
n S n r.nd  then  6.,  so  that  (A)  is  satisfied  for  n s n and  set 

ij  A O 

6 = min(c^  >6,,). 

The  following  technical  definition  is  an  asymptotic  weakened  version 
of  Hampel's  condition  (B)  and  will  plaj  n similar  role. 

Def ini t Ion. 

Con-ition  (B)  is  said  to  bo  a symptet ical ly  satisfied  for  a sequence 

of  estimators  and  a process  p if  given  c > 0,  > 0 there 

exist  positive  integers  k and  n.  and  a 6 > 0 and  for  all  n S n 

0 O 

a set  F i such  that 
r.  n 

1 - n (5.2) 

and  if  x ^ F , y € Q , and 
n 

X y 

where  ;i^  is  the  Prohorov  distance  with  respect  to  p ns  in  (2.4),  then 
k k 

d(S^(x"),s^(y"))  < e . (5..t) 

I 

1 

i 
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If  we  forced  = 1,  then  the  above  condition  would  be  identical 
to  Hampel' s except  for  the  fact  that  we  allow  a general  k (which  may 
depend  on  e and  n)  while  he  requires  k = 1.  Hence  our  condition 
is  weaker  (his  condition  (B)  implies;  ours,  hut  not  conversely).  The 
following  is  analogous  to  Hampel's  Lei-mia  1. 


Lemma  5. 2 

If  u € and  {S^}  asymptotically  satisfy  condition  (B) , then 

{S  ] is  asjTnptotically  robust  at 
n 


Procf . 


Ch-^csf  € as  in  (A).  For  (B)  use  the  same  €,  set  = €/2 

and  let  k,  6^,  n^,  be  the  promised  objects  for  n a n^.  Choose 

6 = mln(-^,  4).  The  key  to  the  proof  is  that  given,  x"  and  y", 

the  meas-rc  p'  on  which  assigns  probability  n ^ to  each 

pair  k-tuple  x^  , i=0,l n-k,  * • • • « Vl '*0 " * "*l+k-n^  ' 

n k k 

i = n-k-1 . . . . ,n-l , is  in  ) a nd  hence 

n n 
X y 


, k k.  . _ ~1  V / 

n-  n>  ^ ^'Pk  = " A Pk^^i'^i^- 


h-1 

^^i  • * * * * ^ 

i=n-k+l 


i+k-n’ 


^i yi^k-n> 


= n~  Ti  p(x  ,y  ) = p (x",y") 

i=0  ' 


(5.5) 


for  all  k (and  hence  p(p  , x ) is  small  if  p (x”,y")  is). 

n n n 

X y 

Let  p be  the  stationary  process  yielding  = P(fXf’') 


we  have 
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(5.6) 


n-1 

E 

1=0 

p(X^,Yj) 

= p(p.v) 

£ 

fi 

and  hence  from  Chebychev’s  inequality 


p(x.yP,5(M-  n^  ^ ^ 

X y 


.1/2 


whence 


and 


p(x,y:nj^(a  > fig)  ^ 

X > 

p(x,y:II^C.  n>  > « > ^ * 

X y 


p(x,y:x  cF^.  ^ ^ ^ 

X y 


fe  l-e/2-e/2  = 1-e 


which  fror.  (B)  implies  that  with  probability  l-€  d (S^ (x" ) , S^(y") ) < € 

and  hence  _ , v”s~*^  ) s e,  completing  the  proof, 

d n n 

The  following  definition  is  a weakened  version  of  one  of  Hampel’s 
corresponding  definitions. 


Definition. 

A sequence  of  estimators  (S^}  is  continuous  at  m- 
€ > 0,  there  exist  positive  integers  k,  n^  and  a 6 > 
n.manp,  xen.yen,  and 
\(P  < fi 

X 

y 


if  given 
0 such  that  if 


(5.7) 
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then 


d(S  (x“),S  (y”))  < € . (5.8) 

n In 

If  a single  k works  for  all  €,  we  say  {S  ) is  continuous  of  order 

n 

k at  ^ (or  continuous  at  ^ ). 

Hampel's  definition  of  continuity  of  an  estimator  sequence  is  what 
we  call  continuity  of  order  1 (or  at  Hampel  essentially  restricts 

his  estimator  sequence  to  depend  only  on  the  marginal  properties  of  the 
process.  Analogous  to  our  strong  continuity  of  parameters,  we  allow  the 
estimator  sequence  to  depend  on  higher  order  properties,  but  for  a given 
€ > 0 t.iare  must  be  a finite  k such  that  matching  sample  distributions 
of  order  k to  the  underlying  ,j,  forces  the  estimators  to  match  up  for 
long  observation  sequences. 

Ana^otous  to  Hampel's  special  case,  if  a parameter  -»  A is 

strongly  crntlnuous,  then  the  sequence  of  estimators  defined  by 

S (x")  =3;-  ) is  continuous, 

n n 

X 

The  following  lemma  is  a strict  generalization  of  Hampel's  Lemma  2 

since  our  continuity  notion  for  (S  } is  weaker  than  his. 

n 

Lemr:2  5.3 

If  's  ) is  continuous  at  e m , then,  under  , [s  ) is 
n Bn 

consistent  for  some  S^(>jk),  that  is,  for  any  6 > 0 

11m  p(x:d(S  (x"),S  (>*))  > 6)  = 0 . 

n • 
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r 


Proof. 


For  a sequence  I 0 choose  6^^  i 0 and  n^  t «»  such  that  the 


continuity  ondition  is  fulfilled  for  n.m  > n^  (for  each  i).  Define 


for  positive  Integers  k,n  and  6 > 0 the  set 


B^(k,6)  = {x":n^(p  < 6) 

X 


and  note  from  Lemma  3.1  that  for  fixed  k,6 


lira  ^“(B  (k,6))  = 1 

n 


(5.9) 


From  the  continuity  condition,  if  x"e  B (k  ,6.),  y^e  B (k  ,6  ), 

n i i m i i 


n,n  a n , then  d(S  (x”),S  (y™))  < 5;  and  hence  the  set 
1 n m 


u 


u 


S (x“)  C A 
n 


(5. 10) 


n & n X e B (k  ,6. ) 

1 nil 


has  distr'cter  dlam(C. ) s 2 . Defining  the  set  S (B  (k.  ,6  ))  = 

1 * n n i i 


X ,6.  ) 

nil 


S (x  ),  (5.10)  can  also  be  written 


U S (B  (k. .6,))  . 

^ n n 1 1 

n,m  i 


Note  also  that  since  all  spaces  are  Polish,  measurability  of  implies 


B (k,6)  = 3..  Define  the  set 
n A 


A'  = no 

j=i  ^ 


u n s (B  (k.,6.)) 

n.ni^n^  J = 1 " " ^ ^ 


and  lot  denote  the  closure  of  (A^  will  play  the  role  of 

Hampel’s  A^^  in  our  case).  The  A^  are  closed  and  monotone  decreasing 
since  Aj  > and  dlam  s 2e^  i 0.  Furthermore,  the  sets  A^ 

are  nonempty  as  can  be  seen  as  follows:  For  fixed  1 and  n a n^,  we 


24 


4 

have  from  (5.9)  that 

n(x:S  (x*^)  € A.)  S5  n(x:S  (x“)€A') 
n 1 n X 

1 

a n(x:S^(x")  6 n s„(B^(k.,6,))) 
n n n J j 

1 

k ^J,(x:x“€  riB  (k  ,6.))  -»  1 (5.11) 

j=l 

and  hence  cannot  be  empty.  Since  A is  complete  and  the  are 

closed,  monotone  decreasing,  and  empty,  from  the  Cantor  intersection  theorem, 
there  exists  a single  point,  say  S^Oi.),  such  that  A^  1 S^Cj>).  Coupled 
with  (3.J1),  this  proves  the  lemma. 

CorollaJT'r  5.1;  Given  fS^}»  > S^Ca)  as  in  Lemma  5.3,  given  e>  0 
there  exists  a fi,  k,  n^  such  that  if  n S n^  and 

X 

then  d(S  ,S  (x"))  < €. 

— n 

Proof. 

Using  the  notation  of  the  previous  proof,  choose  1 so  large  that 

kf  k 

€ 2 2e^  and  set  6 = 6^,  n i n^,  ^ n ’ **  ^ ^ ®i 

*i  x" 

S^(x")  € G,  = U S^(B  (k,,6,))  DA'  Since  S (m.)  € A,  and 
n 1 iBn^  n n 1 1 i cs  1 

diam  G^  < 2e^,  this  implies  d(S^(x") ,S^C. ) ) < ze^  i €,  completing  the 
proof. 
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The  previous  corollary  simply  makes  explicit  a fact  useful  for  the 
next  result  that  is  obvious  in  Hampel's  case. 

The  following  theorem  is  the  main  result  of  this  paper  and  is  the 
analog  to  Hampel's  theorem  for  stationary  and  ergodic  processes  and  the 
general  sequence  of  estimators  here  considered.  We  show  that  continuity 
of  (S^)  Implies  asymptotically  robust  and  continuity  of  the 
considered  as  point  functions  implies  small  sample  robust. 


Theor^^ip  5. 1 

Let  a sequence  of  estimators  ® ^ ^ ^ such  that 

(i)  is  continuous  as  a point  function  on  (1**  for  every  n, 

that  is,  given  n,  x^t  n”,  e > 0,  there  exists  a 6 = 6(n,x”,e) 

such  that  p s 6 implies  d(S  (x*'),S  (y**))  < e. 

n n n 

(ii>  is  continuous  at  |i,  stationary  and  ergodic. 

Then  '' s ' is  robust  for  at  ti. 

n e 


Comments.  Condition  (1)  might  appear  different  from  that  of 
Hampel  since  we  use  p^(x  ,y  ) = n 2-.  -pCx^.y^^)  and  he  uses 

P^Cx^.y”)  = maxj^p(x^,yj^).  These  metrics  generate  the  same  (product) 
topology,  however,  and  hence  the  notions  are  equivalent.  Recall  also 
that  (ii)  is  weaker  than  Hampel's  corresponding  assumption  and  the 
observation  processes  are  far  more  general,  but  that  our  conclusion  is 
in  general  slightly  weaker.  Wc  also  note  that  for  large  n our  proof 
parallels  Hampel's  by  proving  condition  (B).  For  small  n,  however, 
robustness  is  proved  directly  from  ( 1 ) and  our  proof  is  simpler  than 
Hampel' s. 


1 


r 


Proof. 

First  choose  e > 0,  n>  0 for  property  (B).  From  Lemma  5.1  and 
Its  corollary  there  exists  > 0»  k such  that  for 


n 2 n. 


n’  ^ d(S^(M>),S  <x”))  < 6/2  . (5.12) 

X 

From  Lemma  3, 1 there  exists  an  2 n.  so  large  that  if  n ^ n . 

0 1 0' 

> 6 ) < 1 

X 

For  n i define  F^  = fx":n^0i 

X 


ij.“(F‘’)  ir  1-1  and  If  x"€  F , y"e  n",  then  if  ]T  C-  “,p  *)  < 

n K n n C 

* y 

have 

< IT  / IT  / ^ 

+ IT  C.  ) ="  26 

y X X 

and  henc?  from  (5.12),  d(S^<p  ) ,S^(y") ) < 6/2  and  therefore 


k k. 


we 


d(S^  x”),S^(y"))  S d(S^(x"),S^(n))  + d (S^(^  ) ,S^(y”) ) s € 

proving  condition  (B)  is  asymptotically  satisfied  and  hence  by  Lemma  5.2 
[S^)  is  3 33nnptotlcally  robust  at  u. 

We  next  prove  that  (i)  implies  that  {S  ) is  small  sample  robust 

n 

at  pi,  which  by  Lemma  5.1  will  complete  the  proof.  Given  €>  0 as 

before  and  any  n,  there  exists  from  Parthasarathy  (1967)  Thm.  3.2,  Ch.  3, 

a compact  set  such  that 

pp"(K  ) > 1-6/4,  v"(K  ) > l-€/4 

n n 

Since  S :n"  -♦  A,  it  is  uniformly  continuous  on  K and  hence  there  is 
**  n 
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I 


II 


a 6 such  that  for  x”,y"€  K , p (x*^,y")  < 6 implies 
n n n n 

d(S  (x"),S  (y”))  < €.  Choose  6 so  small  that  fi  s mln(6?» 1=1 , . . . ,n„, €^/4) 
n n 10 

and  let  p € P CaiV)  yield  p(^,v)  = E p(X  ,Y  ) s 6.  We  have  using  the 
G p (J  U 

Chebychev  inequality  that 

p(x,y:d(S^(x"),S^(y"))  > €)  ^ lA”(K°)+v"(K^)+p(x,y: p^(x",y“)  > 6^) 

s £/2  + p(x,y:p^(x".y'')  > ^ €/2  + 6^^  S e 

n 

and  hence 

„ , n„-l  n„-l-  ^ 

n^(u  ,v  ) fi  6 , 

completing  the  proof, 

T.u-  v'.ily  point  in  the  preceding  development  where  orgodicity  was  i 

roquir?d  v.'as  in  the  use  of  Lemma  3.  I in  Lemma  5.3  ensuring  sample 

dlstrlbutiors  of  the  process  p,  converged  to  the  actual  distribution  of 

p,  in  the  sense  of  (3.3).  The  resulting  consistency  of  (S  ) at  u 

n - 

was  then  in  turn  used  to  prove  asymptotic  robustness  at  p.  In  particular, 
if  the  process  is  ergodic,  but  we  allow  the  processes  v of 
'fheoren  3.1  to  be  stationary  but  not  necessarily  ergodic,  then  the  entire 
proof  goes  through  as  before  giving  the  following. 

Corollary  5.2:  Given  the  conditions  of  Theorem  5.1,  then  fs  ) is  robust 
■ " n 

for  at  p. 

That  robustness  for  the  class  of  ergodic  processes  implies  robustness 
for  the  class  of  stationary  processes  also  can  be  seen  from  the  ergodic 
decomposition  theorem  of  Rohlin  (1949)  which  states,  roughly,  that  every 
stationary  nonergodic  process  is  a mixture  of  ergodic  processes,  that  is, 
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can  be  viewed  aa  nature  first  selecting  an  ergodlc  process  (unknown  to 

the  observer)  and  then  sending  a sample  function  from  the  ergodlc 

process.  Thus,  if  v is  stationary,  the  observer  will  actually  see 

some  unknown  ergodlc  component,  say  v , of  v and  hence  robustness  for 

9 

ergodlc  processes  will  ensure  robustness  for  stationary  nonergodlc 
processes. 

Corollary  5.3;  Let  -♦A  be  such  that  S is  strongly  continuous  at 

u € rnd  = SC^.  is  a continuous  mapping  from  ^ to  A. 

' X . 

Then  'S  } is  robust  for  m at  ... 
r e 

Note  that  if  S is  strongly  continuous  for  all  p,  then 

S (x")  r 5?..  ) is  automaticsTly  continuous  as  a point  function  from 

n n 

X 

(2.6). 

Analogous  to  Hampel's  Lemma  3 and  corollary  we  have  the  following. 

Lemr-a  5.4 

If  is  robust  at  .i  e tt^  and  consistent  for  S^(_)  at  all 

V € in  a p neighborhood  of  u,  then  S^(j.)  is  weakly  continuous 
at  - . 

Proof. 

Since  (S  } is  robust,  given  €>  0 there  is  a 6>  0 such  that 
n 

if  o(u,v)  < 6,  then  ^ ^ consistency, 

for  6 small  enough 

lim  M.(x:d(S  (x"),S^C-))>  e)  = 0 

_ n * 

n -*  «» 

lim  v(y:d(S  (y"),S  (v))  > e)  = 0 

n * 

n • 
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ifiitfi'liriitriir  T 


and  hence  if  is  the  measure  on  CA,S^)  assigning  probability 

to  the  point  S^(p)  and  that  assigning  probability  one  to 


S.(v). 


n Ci"s"^,a  ) -♦  0 

a n 1 

n-»* 

n-yo 


and  hence  ^ Since  and  are  degenerate,  however, 

s € only  if  d(S^(p)  ,S^(v))  s 6,  proving  the  lemma. 

Corollary  5.4:  If  (s^)  ie  robust  and  continuous  for  all  ,,  e tt^, 
then  S Cw. ) is  weakly  continuous  at  all  a. 
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Discussion  and  Applications 


Our  approach  allows  the  construction  of  robust  estimators  for 

th 

parameters  Included  In  the  K order  (K  finite,  fixed)  restriction 
K K K 

(n  ^ *“  ergodlc  stationary  process  [n,|i,X].  Such  parameters 

are  the  moments  of  order  less  than  or  equal  to  K. 

The  M-estlmation  S^(p)  of  a scalar  parameter  S included  in 
K K K 

(n  ^ will  be  now  the  solution  (if  it  exists)  of  the  expression 

[Huber  (1964),  Huber  (1972)] 


J*  » • • • » ®ao  ^ *(dX^  , . . . , ~ 


(6.1) 


As  in  the  i.i.d.  case,  the  sequence  of  estimators  {S^)  defined  by  t 
n-K 

(S  : L ,...,x  ,S  ) = 0)  is  robust  if  the  solution  is  (6.1)  is 

* 

unique  a.id  7 is  bounded.  In  other  words,  one  should  look  for  bounded, 

"smooth"  functions  v with  zero  ^ expectation. 

For  The  robust  estimation  of  a location  parameter,  in  particular, 

M-estina: ;rs , L-estlmators  or  R-estimators , can  be  used  again  [Huber 

(1972)],  xhere  the  first  order  restriction  [n^,S^,li^]  of  the  ergodic 

stationary  process  [n,)^,X]  is  considered.  For  the  M-estimators,  we 
th  K K K 

may  use  the  K restriction  [n  3 instead  and  recover  the 

estimate  from  the  expression: 

J ijr(Xj-S^(p) Xjj-S^(u))  u*(dXj dXj^)  = 0 (6.2) 


The  asymptotic  distribution  of  the  estimate  S^(p)  can  be  found  by 
methods  similar  to  the  ones  used  by  Huber  (1964). 


New  estimators  determined  through  new  functionals  of  the  data  may 
be  considered,  where  the  properties  of  the  functionals  may  be  determined 


through  the  conditions  in  Theorem  5.1. 
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APPENDIX  A:  Equations  (2.1)  and  (2.3a)  are  actually  minima.  (The 
proof  is  due  to  P.C,  Shields.) 

Since  n and  hence  H are  complete,  separable  metric  spaces, 

any  measure  u on  (flT,®*)  is  tight,  that  is,  for  any  €>  0 there 

As  a compact  set  F such  that  ^*(F)  a l-€  [Parthassarathy  (1967), 

Thm.  3.2,  p.  29].  If  one  has  a family  of  measures  such  that  given  € 

there  is  a compact  set  F such  that  all  members  of  the  family  place 

measure  at  least  1 - fi  on  F,  then  the  family  is  compact  in  the  weak 

topology  .Parthusarathy  (1967),  Thm.  6.7,  p.  47].  Given  p,  v choose 

compact  that  p(F)  i 1 - e/2,  v(F)  i 1 - e/2,  then  if 

p €^^g(f..,v),  p(F  X F)  a 1 - € and  FX  F is  compact.  Thus  P^(p,v) 

is  compac*  In  the  weak  topology  and  a sequence  p c P (u.v)  such  that 

n s 

X ^ P(P.V)  + 1/n 

n 

will  have  a subsequence  — say  p — that  converges  in  the  weak  topology 

"k  ■ 

to  a limiting  p.  The  limit  p € P (p.v)  and  E P(X  ,Y  ) = p(p,v), 

s p 0 0 

completing  the  proof.  The  same  argument  applied  to  (n",s|^)  shows  that 

p is  also  actually  a minimum, 
n 
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frThe  object  of  this  grant  is  the  analysis  £ design  of  decision  procedures  that 
have  stable,  good  performance  in  statistically  ill-defined  environments.  Such 
procedures  indicate  the  way  to  design  powerful  receivers  for  systems  whose 
statistical  behavior  cannot  be  described  precisely  (due  to  incomplete 
availability  of  data  about  the  system  behavior).  Different  distance  measures 
have  been  studied  for  use  as  performance  criteria  for  robust  estimates. 

Careful  evaluation  and  comparison  of  these  distances  was  done  and  their 
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^ 20.  Abstract  ^ 

V s imi lar i t ies , advantages  and  disadvantages  were  carefully  stated.  A 
thorough  study  of  the  work  already  accomplished  (by  the  author  as 
well  as  other  investigators)  on  nonparametr ic  statistical  procedures 
in  the  presence  of  small  number  of  discrete  data  was  done  and  included 
in  a book  on  the  use  of  nonparametr ic  procedures  in  Communication 
Systems.  A feature  selection  problem  was  studied,  when  serveral 
distance  measures  are  used  as  discrimination  criteria.  A.  sequential 
procedure  for  clustered  data  was  proposed  and  analyzed.  Hampel's 
general  qualitative  definition  of  robustness  of  sequences  of 
estimators  on  memoryless  observation  processes  was  generalized  to 
stationary  processes.  The  constructive  analysis  of  robustness  completed 
by  the  author  is  being  used  now  in  the  performance  analysis  of  communicat 
Networks  at  Bell  Laboratories. 
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